-
Notifications
You must be signed in to change notification settings - Fork 497
add option to use synthetic input data #1632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @alfuyao1986! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please justify the value of this change, following https://github.com/pytorch/torchtitan/blob/main/CONTRIBUTING.md#proof-of-value
In particular, why fake data is better than the default c4
/ c4_test
?
|
||
def __iter__(self) -> Iterator[Tuple[Dict[str, torch.Tensor], torch.Tensor]]: | ||
while True: | ||
inputs = torch.randint( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fake data, not "synthetic" data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, how about call it random data? The goal is to remove dataset dependency for quick performance benchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default c4 will have two problems:
- Although small enough, but still have dependency for user to download before testing, and in rapid debugging and reruns, it is possible to hit HF request limit. Other case is in an unstable network, also affecting smooth development. I had to make local changes like this so I can developing without worry about dataset. Guess many users also has similar experience.
- With larger models and bigger batch size runs, it will easily loopback data, but same reason as 1, it may be limited or time consuming in many cases for user to download very large dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random dataset is usually very useful when debugging the CPU overhead brought by data loading, though I'm not sure if we already have such a use case. Multimodal may be benefit from random dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although small enough, but still have dependency for user to download before testing, and in rapid debugging and reruns, it is possible to hit HF request limit. Other case is in an unstable network, also affecting smooth development. I had to make local changes like this so I can developing without worry about dataset. Guess many users also has similar experience.
We have c4_test
stored in the repo
https://github.com/pytorch/torchtitan/tree/main/tests/assets/c4_test
With larger models and bigger batch size runs, it will easily loopback data, but same reason as 1, it may be limited or time consuming in many cases for user to download very large dataset.
What would be the advantage of using random / fake data versus looping back on c4_test
?
Multimodal may be benefit from random dataset.
As we don't have multimodal training, I think the main thing I'd like to understand what's the benefit of adding random data on top of existing c4_test
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random dataset generally can skip the overhead of data loading, like actually reading from a disk. This is not related to whether the dataset is large or small. But as mentioned above, this may be more useful when we start to see dataloader overhead is a big thing. As for development efficiency, I didn't encounter such an issue, so I should not be the one to answer.
This is solely my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, given c4_test is already pre-stored in the repo, for most of the cases, it should be fine. I am actually completely fine with using pre-stored c4_test dataset. Only two more consideration just bring up for discussion.
- Random dataset can usually stress the whole stack better, numerically and computationally, vs. a small repeated dataset, but it is debatable that whether this additional stress practically realistic and necessary.
- Other frameworks (MaxText, Megatron-LM) do provide "synthetic/mock" data options for fast benchmarking, for ease of comparison point of view, it may be better to have a matching option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the context!
From my perspective, the value of this dataset is somewhat limited, given we already have c4_test which doesn't involve randomness so has become a standard way for numerical testing even when parallelism / world size changes.
That said, if people have strong opinion to add this dataset, I'm OK, too. If that's the case, I would suggest making a new builder function & file, instead of piggyback on existing build_hf_dataloader
. I understand that would make it harder to switch to this new dataset from config, but that's not a good reason to reuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we definitely shouldn't use build_hf_dataloader
for random dataset. There is actually another benefit of random dataset (when it has a deterministic option) -- debugging checkpoint issue. Given that the dataloader is controlled by other package, having a random dataset with a deterministic option will make debugging checkpoint inconsistency easier, at least we can rule out the dataset/dataloader problem.
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
No description provided.